Logging for Uncoordinated Checkpointing Protocols Achour
نویسندگان
چکیده
A message is in-transit with respect to a global state if its sending is recorded in this global state, while its receipt is not. Checkpointing algorithms have to log such in-transit messages in order to restore the state of channels when a computation has to be resumed from a consistent global state after a failure has occurred. Coordinated checkpointing algorithms log those in-transit messages exactly on stable storage. Because of their lack of synchronization, uncoordinated checkpointing algorithms conservatively log more messages. This paper presents an uncoordinated checkpointing protocol that logs all in-transit messages and the smallest possible number of non in-transit messages. As a consequence, the protocol saves stable storage space and enables quicker recoveries. An appropriate tracking of message causal dependencies constitutes the core of the protocol. Enregistrement s electif de messages lors d'une d eenition non coordonn ee de points de contr^ ole R esum e : Les algorithmes de d eenition de points de reprise n ecessitent l'enregistrement de certains messages dits en transit aan de pouvoir restaurer l' etat des canaux lors de recouvrements arri eres. Un message est en transit si son emission est enregistr ee dans l' etat global de l'application alors que sa r eception de l'est pas. Les algorithmes coordonn es enregistrent exactement en m emoire stable les messages en transit relativement a l' etat global calcul e alors que les algorithmes non coordonn es enregistrent la totalit e des messages. Cet article, d'une part, d eenit une classe de messages qui ne peuvent ^ etre en transit dans aucun etat global coh erent et d'autre part, propose un algorithme qui n'enregistre pas ce type de messages et permet ainsi de r eduire l'occupation de la m emoire stable.
منابع مشابه
Efficient Message Logging for Uncoordinated Checkpointing Protocols
HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt età la diffusion de documents scientifiques de niveau r...
متن کاملDefining the Checkpoint Interval for Uncoordinated Checkpointing Protocols
Parallel applications running on large computers suffer from the absence of a reliable environment. Fault tolerance proposals, in general, rely on rollback-recovery strategies supported by checkpoint and/or message logging. There are well-defined models that address the optimum checkpoint interval for coordinated checkpointing. Nevertheless, there is a lack of models concerning uncoordinated ch...
متن کاملSemantics of recovery lines for backward recovery in distributed systems
This paper addresses the definition of recovery lines in the context of backward recovery whose aim is to cope with failures in distributed sytems. A general framework that allows for several semantics of recovery lines is introduced. Key notions such as missing messages and orphan messages are precisely defined and their impact on the definition of consistency of recovery lines is carefully an...
متن کاملCoordinated Checkpoint versus Message Log for Fault Tolerant MPI
MPI is one of the most adopted programming models for Large Clusters and Grid deployments. However, these systems often suffer from network or node failures. This raises the issue of selecting a fault tolerance approach for MPI. Automatic and transparent ones are based on either coordinated checkpointing or message logging associated with uncoordinated checkpoint. They are many protocols, imple...
متن کاملOn the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications
Fault tolerance is becoming a major concern in HPC systems. The two traditional approaches for message passing applications, coordinated checkpointing and message logging, have severe scalability issues. Coordinated checkpointing protocols make all processes roll back after a failure. Message logging protocols log a huge amount of data and can induce an overhead on communication performance. Hi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1996